Collaborators
- Daniel Jiao (dj2764) - Yuyang Wang (yw4660) - Xinyu Wang (xw3106) - Haoyi Tan (ht2717)

Motivation

Motor vehicle collisions are a major public health and transportation concern in New York City. Every day, hundreds of crashes occur across the five boroughs, affecting pedestrians, cyclists, motorists, and entire communities. Living in a dense and highly mobile urban environment, we constantly witness the consequences of unsafe road conditions, heavy traffic, and the complex interaction between human behavior and the built environment. These experiences motivated us to investigate the patterns and factors underlying NYC collisions more systematically.

Through this project, we aimed to combine our growing data science skills with a real-world issue that has substantial societal impact. Traffic safety is not only a transportation problem—it is a public health problem. By exploring large-scale open data from NYC, we sought to better understand when and where crashes occur, what environmental or contextual factors contribute to them, and how these factors relate to injury severity.

In particular, we were interested in how variables such as time of day, borough differences, and vehicle type may influence collision outcomes. These aspects are often discussed anecdotally—“nighttime is dangerous,” “Brooklyn has the worst traffic,” “rainy weather leads to more accidents”—but they rarely receive thorough, data-driven examination. Our motivation was to create a clear, visual, and analytical exploration of these patterns in order to turn widely held assumptions into evidence-based insights.

Ultimately, this project allowed us to apply R’s data manipulation, mapping, and modeling tools to investigate a real problem with real stakes. By analyzing NYC’s collision trends, we hope to contribute not only to our technical learning but also to broader conversations about how to make urban environments safer for everyone.

Goals & Research Questions

Understanding the patterns and determinants of motor vehicle collisions is essential for improving transportation safety in New York City. The overarching goal of this project is to translate NYC’s collision data into meaningful evidence that can support safer urban mobility. We seek to uncover when and where collisions are most likely to occur, how severe these incidents typically are, and which factors contribute most strongly to serious outcomes such as injuries and fatalities. By integrating exploratory visualization, mapping, and statistical modeling, we aim to provide a holistic and data-driven understanding of collision risks across the city. To guide our analysis, we focus on the following research questions:

  • When do collisions most frequently occur?
  • Where do collisions occur, and which boroughs are most affected?
  • How severe are collisions in New York City?
  • Which factors are associated with injuries or severe outcomes?

Data

The primary dataset used in this project is the Motor Vehicle Collisions – Crashes dataset from NYC Open Data, which provides detailed, police-reported information for each crash, including date and time, borough, geographic coordinates, injury and fatality counts, contributing factors, and vehicle types. To provide context on traffic exposure, we also reference the NYC Automated Traffic Volume Counts dataset, which contains hourly vehicle counts collected by Department of Transportation roadway sensors. Together, these datasets allow us to examine collision patterns alongside roadway usage. For analysis, we cleaned and standardized the collision data by converting timestamps, extracting relevant temporal features, harmonizing categorical variables, and removing entries with missing boroughs or invalid coordinates, resulting in a structured dataset suitable for exploring spatial, temporal, and contextual factors related to collision severity.

Data Cleaning Summary

We performed several cleaning steps to prepare the dataset for analysis:

Removed entries missing borough or geographic coordinates (representing X% of raw observations).

Standardized date/time formats and extracted features such as hour, weekday, and month.

Collapsed long-tail categories for contributing factors and vehicle types by grouping rare values (<0.5% frequency) into an “Other” category.

Removed records with coordinate errors such as (0, 0) or values outside the NYC bounding box.

Ensured consistency across duplicated entries by retaining the earliest timestamp per report. These transformations produced a clean analytical dataset with N observations.

Exploratory Analysis

Spatial Overview

mvc_clean |>
  count(borough, name = "n") |>
  ggplot(aes(x = reorder(borough, n), y = n, fill = borough)) +
  geom_col() +
  scale_fill_viridis_d() +
  coord_flip() +
  labs(
    title = "Total Collisions by Borough",
    x = "Borough",
    y = "Number of Collisions"
  ) +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none")

Collisions are unevenly distributed across New York City. Brooklyn records the highest number of crashes, followed by Queens and Manhattan, reflecting their larger populations and higher traffic volumes. The Bronx shows moderately lower totals, while Staten Island has substantially fewer collisions, consistent with its smaller population density and reduced roadway congestion. These spatial differences highlight how urban form and traffic intensity contribute to collision risk across boroughs.

How severe are NYC collisions?

mvc_clean |>
  mutate(
    severity = case_when(
      number_of_persons_killed > 0 ~ "Fatal",
      number_of_persons_injured > 0 ~ "Injury",
      TRUE ~ "Property Damage Only"
    )
  ) |>
  count(severity) |>
  ggplot(aes(x = severity, y = n, fill = severity)) +
  geom_col() +
  scale_fill_viridis_d() +
  labs(
    title = "Collision Severity Distribution",
    x = NULL,
    y = "Number of Crashes"
  ) +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none")

The vast majority of collisions result in property damage only, with injury crashes accounting for a smaller but still substantial proportion of events. This distribution reflects well-established traffic safety patterns: while crashes occur frequently, there’s a small fraction involve serious or life-threatening outcomes. The imbalance also underscores the importance of modeling severity as an ordinal outcome, as the categories are highly imbalanced across levels.

mvc_clean |>
  mutate(
    hour = hour(crash_time),
    severity = case_when(
      number_of_persons_killed > 0 ~ "Fatal",
      number_of_persons_injured > 0 ~ "Injury",
      TRUE ~ "PDO"
    )
  ) |>
  count(hour, severity) |>
  ggplot(aes(x = hour, y = n, color = severity)) +
  geom_line(size = 1.1) +
  scale_color_viridis_d(direction = -1) +
  scale_x_continuous(breaks = 0:23) +
  labs(
    title = "Hourly Pattern of Injuries and Fatalities",
    x = "Hour",
    y = "Count",
    color = "Severity"
  ) +
  theme_minimal(base_size = 15)

The hourly distribution of collision outcomes reveals strong diurnal patterns. Property-damage-only (PDO) crashes peak sharply during the afternoon and early evening, aligning with increases in traffic volume during commuting hours. Injury crashes follow a similar but muted pattern, with the highest counts occurring between 3 PM and 6 PM. In contrast, fatal crashes remain consistently low throughout the day, showing no pronounced hourly variation due to their rarity. These patterns highlight the strong influence of daily travel cycles on collision frequency and severity.

Temporal Patterns

mvc_clean |>
  mutate(hour = hour(crash_time)) |>
  count(hour) |>
  ggplot(aes(x = hour, y = n)) +
  geom_line(color = "#1f78b4", size = 1.2) +
  geom_point(color = "#084c7f", size = 2) +
  scale_x_continuous(breaks = 0:23) +
  labs(
    title = "Collisions by Hour of Day",
    x = "Hour",
    y = "Number of Crashes"
  ) +
  theme_minimal(base_size = 15)

Collision frequency varies substantially across the day, showing a clear diurnal rhythm. Crashes are least common in the early morning hours (2–5 AM), when road activity is lowest, and rise sharply after 7 AM with the onset of the morning commute. Counts remain elevated throughout the afternoon and peak between 3 PM and 5 PM, coinciding with the evening rush period. The decline after 7 PM reflects reduced traffic volume later in the evening. These patterns demonstrate how daily mobility cycles strongly shape collision risk.

mvc_clean |>
  mutate(weekday = wday(crash_date, label = TRUE)) |>
  count(weekday) |>
  ggplot(aes(x = weekday, y = n, fill = weekday)) +
  geom_col() +
  scale_fill_viridis_d() +
  labs(
    title = "Collisions by Day of Week",
    x = NULL,
    y = "Number of Collisions"
  ) +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none")

Weekly patterns show moderately higher collision counts on weekdays, with Friday exhibiting the highest number of crashes. This pattern likely reflects increased travel demand at the end of the workweek, including both commuting and social activity. Weekends, particularly Sunday, show lower totals, consistent with reduced overall traffic. Although the variation across days is less dramatic than the hourly pattern, the results highlight predictable weekly rhythms in traffic exposure.

mvc_clean |>
  mutate(
    hour = hour(crash_time),
    weekday = wday(crash_date, label = TRUE)
  ) |>
  count(weekday, hour) |>
  ggplot(aes(x = hour, y = weekday, fill = n)) +
  geom_tile() +
  scale_fill_viridis(option = "C") +
  labs(
    title = "Heatmap of Collisions by Hour and Day of Week",
    x = "Hour of Day",
    y = "Day of Week",
    fill = "Collisions"
  ) +
  theme_minimal(base_size = 15)

The heatmap above illustrates clear temporal regularities in collision activity. In all the weekdays, crashes are concentrated between late morning and early evening, with the highest intensity occurring during the afternoon commute. Early morning hours show consistently low activity throughout the week. Weekend patterns are similar but less pronounced, reflecting reduced work-related travel. Overall, the heatmap highlights how collision risk is jointly shaped by hourly traffic cycles and weekly mobility patterns.

Contextual Overview

mvc_clean |>
  count(contributing_factor_vehicle_1) |>
  arrange(desc(n)) |>
  slice_head(n = 10) |>
  ggplot(aes(x = reorder(contributing_factor_vehicle_1, n), y = n)) +
  geom_col(fill = "#1f78b4") +
  coord_flip() +
  labs(
    title = "Top 10 Contributing Factors",
    x = "Contributing Factor",
    y = "Count"
  ) +
  theme_minimal(base_size = 15)

mvc_clean |>
  count(vehicle_type_code_1) |>
  arrange(desc(n)) |>
  slice_head(n = 10) |>
  ggplot(aes(x = reorder(vehicle_type_code_1, n), y = n)) +
  geom_col(fill = "#33a02c") +
  coord_flip() +
  labs(
    title = "Top 10 Vehicle Types in Crashes",
    x = "Vehicle Type",
    y = "Count"
  ) +
  theme_minimal(base_size = 15)

The contextual summaries highlight common factors and vehicle types involved in NYC collisions. “Unspecified” and “Driver Inattention/Distraction” are by far the most frequently reported contributing factors, underscoring the persistent challenge of distraction-related crashes. Other behaviors such as failing to yield, following too closely, and improper backing also appear prominently. In terms of vehicle type, sedans and sport utility vehicles account for the majority of crash-involved vehicles, reflecting their prevalence in the city’s vehicle fleet. The dominance of these categories suggests that both human behavior and common passenger vehicle types play central roles in collision dynamics across New York City.

Additional Analysis

To further explore the the collision in new york city, we use some heatmap to visualizes the spatial intensity of motor vehicle collisions across New York City.

set.seed(123)

mvc_sample <- mvc_clean |>
  filter(!is.na(latitude), !is.na(longitude)) |>
  slice_sample(n = 5000)

leaflet(mvc_sample) |>
  addProviderTiles(providers$CartoDB.DarkMatter) |>
  setView(lng = -73.97, lat = 40.75, zoom = 11) |>
  addHeatmap(
    lng = ~longitude,
    lat = ~latitude,
    blur = 20,
    max = 0.05,
    radius = 10
  )
mvc_sample2 <- mvc_clean |>
  mutate(
    severity = case_when(
      number_of_persons_killed > 0 ~ "Fatal",
      number_of_persons_injured > 0 ~ "Injury",
      TRUE ~ "Property Damage Only"
    )
  ) |>
  slice_sample(n = 5000)

pal <- colorFactor(
  palette = c("#7db7ff", "#2C93E8", "#FF5733"),
  domain = c("Property Damage Only", "Injury", "Fatal")
)

leaflet(mvc_sample2) |>
  addProviderTiles(providers$CartoDB.Positron) |>
  setView(lng = -73.97, lat = 40.75, zoom = 11) |>
  addCircleMarkers(
    lng = ~longitude,
    lat = ~latitude,
    radius = 3,
    stroke = FALSE,
    fillOpacity = 0.6,
    color = ~pal(severity),
    label = ~htmlEscape(severity)
  ) |>
  addLegend(
    position = "bottomright",
    pal = pal,
    values = mvc_sample2$severity,
    title = "Severity"
  )

The heatmap reveals distinct collision hotspots in Manhattan—especially Midtown and Lower Manhattan—along with notable concentrations along major corridors in Brooklyn and Queens. In contrast, Staten Island and peripheral areas of the Bronx show relatively sparse collision activity. The point-level severity map reinforces these spatial trends: injury crashes are distributed widely across the city, whereas fatal crashes cluster more prominently along high-speed arterial roads rather than dense urban centers. These patterns suggest that both urban density and roadway design play important roles in shaping the frequency and severity of collisions in New York City.

Evolution of Research Questions

As we began our exploratory analysis, our original questions—primarily descriptive—expanded in scope. Initially, we aimed only to characterize when and where collisions occur. However, early findings showing strong temporal variation and borough-level differences motivated additional questions about what factors predict injury severity. This led us to incorporate statistical modeling. Similarly, after observing clear temporal peaks in collision frequency, we added correlation analyses with traffic volume to examine whether exposure alone could explain these temporal patterns. This evolution reflects the iterative nature of data-driven research: exploratory findings guided us toward more targeted and explanatory analyses.

Modeling

We model collision severity using count-based injury outcomes. Because the number of persons injured is a non-negative count with many zeros, we compare a traditional linear regression model with a Poisson regression model, and evaluate which specification better fits the structure of the data.

linear regression

lm_mod <- lm(
  number_of_persons_injured ~ borough + hour + number_of_persons_killed,
  data = mvc_model
)

tidy(lm_mod)
## # A tibble: 7 × 5
##   term                     estimate std.error statistic   p.value
##   <chr>                       <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)               0.304   0.00193      157.   0        
## 2 boroughBROOKLYN           0.00272 0.00175        1.55 1.21e-  1
## 3 boroughMANHATTAN         -0.119   0.00187      -63.6  0        
## 4 boroughQUEENS            -0.0293  0.00180      -16.3  1.92e- 59
## 5 boroughSTATEN ISLAND     -0.0466  0.00308      -15.1  1.40e- 51
## 6 hour                      0.00325 0.0000973     33.4  5.20e-244
## 7 number_of_persons_killed  0.262   0.0143        18.3  6.61e- 75

Poisson Regresssion

pois_mod <- glm(
  number_of_persons_injured ~ borough + hour + number_of_persons_killed,
  data = mvc_model,
  family = poisson()
)

tidy(pois_mod, exponentiate = TRUE)
## # A tibble: 7 × 5
##   term                     estimate std.error statistic   p.value
##   <chr>                       <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)                 0.301  0.00503    -238.   0        
## 2 boroughBROOKLYN             1.01   0.00435       1.76 7.80e-  2
## 3 boroughMANHATTAN            0.657  0.00513     -82.0  0        
## 4 boroughQUEENS               0.916  0.00456     -19.3  2.35e- 83
## 5 boroughSTATEN ISLAND        0.866  0.00811     -17.7  3.31e- 70
## 6 hour                        1.01   0.000259     40.7  0        
## 7 number_of_persons_killed    1.65   0.0212       23.5  2.27e-122

model comparison

bind_rows(
  glance(lm_mod) |> mutate(model = "Linear"),
  glance(pois_mod) |> mutate(model = "Poisson")
)
## # A tibble: 2 × 15
##   r.squared adj.r.squared  sigma statistic p.value    df    logLik      AIC      BIC deviance
##       <dbl>         <dbl>  <dbl>     <dbl>   <dbl> <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
## 1   0.00570       0.00570  0.682     1435.       0     6 -1554224. 3108465. 3108563.  697132.
## 2  NA            NA       NA           NA       NA    NA -1110034. 2220083. 2220168. 1450162.
## # ℹ 5 more variables: df.residual <int>, nobs <int>, model <chr>, null.deviance <dbl>,
## #   df.null <int>

Both the linear regression and Poisson regression models were fitted to explore predictors of injury counts in NYC motor vehicle collisions. Across both models, several consistent patterns emerge. First, the number of persons killed is a strong and statistically significant predictor of injury counts, reflecting that more severe crashes tend to involve both fatalities and a higher number of non-fatal injuries. The effect of hour of day is also substantial: each additional hour is associated with higher expected injury counts, with the Poisson model estimating approximately a 1.01–1.02 multiplicative increase per hour, consistent with elevated activity during the afternoon and evening periods.

Spatial variation is notable across boroughs. In the linear model, Manhattan shows lower injury counts relative to the reference borough, while Staten Island and Brooklyn show slightly higher injury counts. The Poisson model produces similar directional effects, though with multiplicative interpretations: for example, crashes in Queens or Staten Island are associated with substantially higher expected injury counts, while Manhattan’s injury rate is significantly lower. These borough-level differences likely reflect differences in traffic volume, urban density, and roadway environments.

When comparing overall model fit, the Poisson model is more appropriate for count data and displays substantially improved goodness-of-fit metrics relative to the linear model, including far higher log-likelihood and lower AIC/BIC values. This suggests that injury counts exhibit overdispersion and non-normality that are better captured through a count-based modeling framework. Together, these findings indicate that temporal, geographic, and crash-severity factors jointly shape the distribution of injuries in NYC collisions, and that Poisson regression provides a more suitable modeling approach for this outcome.

Residual Diagnostics

mvc_model |>
  add_residuals(lm_mod, var = "resid") |>
  add_predictions(lm_mod, var = "pred") |>
  ggplot(aes(x = pred, y = resid)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(
    title = "Linear Model: Residuals vs Fitted",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 14)

mvc_model |>
  add_residuals(pois_mod, var = "resid") |>
  add_predictions(pois_mod, var = "pred") |>
  ggplot(aes(x = pred, y = resid)) +
  geom_point(alpha = 0.3, color = "#2C93E8") +
  geom_hline(yintercept = 0, linetype = 2) +
  labs(
    title = "Poisson Model Residuals vs Fitted",
    x = "Fitted Values",
    y = "Residuals"
  ) +
  theme_minimal(base_size = 14)

Residual diagnostic plots reveal clear differences in model adequacy between the linear and Poisson regressions. The linear model shows strong heteroskedasticity, with residuals tightly clustered near zero for low fitted values but spreading widely as fitted values increase, indicating violation of constant variance and non-normal error assumptions. In contrast, the Poisson model exhibits a more stable residual pattern, with residuals remaining centered around zero and displaying variance patterns more consistent with count data. Although some dispersion remains, the Poisson model aligns substantially better with the underlying distribution of injuries, supporting its use as the preferred model for this outcome.

Modeling Rationale

Predictors were selected based on domain relevance and findings from exploratory analysis. Hour of day captures temporal exposure differences. Borough controls for spatial variation in population density and roadway structure. The number of persons killed indicates crash severity and is theoretically linked to injury likelihood. We compared linear and Poisson regression because the outcome (injury count) is a non-negative count with excess zeros and right-skewness.

Correlation

We also conducted a correlation analysis and found that traffic volume has only a very weak linear association with crash counts in New York City. Citywide, the correlation between hourly traffic volume and collisions is approximately r ≈ 0.10, suggesting that higher-volume periods tend to experience slightly more crashes, but the relationship is minimal and explains very little of the variation in crash frequency. When examining boroughs individually, correlations remain weak across all areas—ranging from essentially no relationship in Staten Island (r ≈ 0.02) to slightly stronger but still weak associations in Queens, the Bronx, and Manhattan (generally r ≈ 0.07–0.16). These findings indicate that traffic volume alone is insufficient to explain spatial or temporal crash patterns, and that other factors—such as road design, intersection characteristics, and driver behavior—likely play a much more substantial role in shaping collision risk.

Discussion

First, the temporal analyses consistently show strong daily and weekly patterns. Crashes peak during the morning and especially the evening commute, with Friday exhibiting the highest weekly counts. Late-night and early-morning hours display markedly lower activity. These results confirm that collision risk is closely aligned with predictable mobility cycles and traffic density.

Secondly, the spatial analyses reveal substantial geographic variation across the five boroughs. Brooklyn and Queens experience the highest number of crashes, followed by Manhattan, while Staten Island records significantly fewer incidents. Heatmaps further demonstrate collision hotspots in dense commercial areas such as Midtown and Lower Manhattan, as well as along major arterial corridors in Brooklyn and Queens, highlighting the influence of urban density and roadway design on collision concentration.

Third, our examination of collision severity indicates that most crashes involve property damage only, with injury crashes forming a smaller but meaningful portion of events. Fatal collisions are rare and tend to cluster along high-speed roadways rather than within dense urban centers. This distribution underscores that while crashes are frequent, severe outcomes remain relatively uncommon.

Finally, our modeling work offers insight into which factors are associated with injury outcomes. Both the linear and Poisson models identify hour of day, borough, and the number of persons killed as significant predictors of injury counts. The Poisson model, which provides a superior fit for count data, demonstrates that injuries are more likely during high-traffic periods and in certain boroughs, while Manhattan consistently shows lower injury risk after adjusting for crash characteristics. These patterns reflect how contextual and environmental factors—such as traffic flow, population density, driver behavior, and roadway structure—interact to shape injury severity.

Together, these findings paint a cohesive picture of collision dynamics in New York City: crashes are most common during peak travel times, concentrated in high-density regions, predominantly minor in severity, and influenced by a combination of temporal, spatial, and behavioral factors. By integrating exploratory visualization, spatial mapping, and statistical modeling, this project provides a comprehensive and evidence-based perspective on traffic safety across the city.

Limitations

Several limitations should be noted. First, police-reported contributing factors are subject to reporting bias and often coded as “Unspecified,” reducing reliability. Second, our modeling does not include weather, roadway geometry, or traffic signalization data, all of which can strongly influence crash risk. Third, injury counts contain many zeros and may exhibit overdispersion beyond what Poisson models capture. Finally, traffic volume data come from a limited set of roadway sensors and may not fully represent actual citywide exposure.

Future Work

Future analyses could incorporate spatial statistical models or geographically weighted regression to better quantify local risk factors. Integrating weather data, intersection characteristics, and speed limit changes would also improve explanatory power. Machine learning approaches such as random forests or gradient boosting could help identify nonlinear interactions among predictors. Finally, linking to hospitalization or EMS records could enable deeper study of injury severity beyond counts.